Fix LLM translation output pollution and refresh translations on body edits#283
Conversation
The translator used fetchLinkedPages as a required context source, which fetched up to 10 linked pages with their full Readability- extracted body and dumped everything into the system prompt as "Additional context: ...". For articles that link to several long technical posts the context easily ballooned to multiples of the source text size, and Claude began confusing the context for the text to translate. Concrete failures observed on @nebuleto/2026/swe-with-coding-agent-in-2026: the en-US row came back as the linked Anthropic blog post in English (~3.5x the original length), and the ja row came back entirely in Korean, paraphrasing the linked content. Switch to the passive fetchWebPage source instead. Vertana exposes passive sources as tools rather than pre-fetching them into the prompt, so the model only invokes a fetch when it actually decides to, which sidesteps the up-front pollution while still leaving a fetch capability on the table. The other context sources (author info, article tags) and Vertana options (refinement, dynamic glossary, technical tone) are unchanged. Existing translation rows that were generated under the old behavior remain in the database; they should be deleted so the auto-request path on /lang regenerates them with the new code. Assisted-by: Claude Code:claude-opus-4-7
Pure refactor of startArticleContentTranslation in models/article.ts. The on-conflict / staleness gate at the front (insert-or-find the placeholder row, decide whether work is needed) stays in the public function. The post-row-creation work — fetching author/tag context, running translate(), writing the result back, sending the federation Update activity, and kicking off post-translation summarization — moves into a new private async helper runArticleContentTranslation(fedCtx, queued). No behavior change. The synchronous setup phase (article-source fetch) is still awaited before the translate() chain is scheduled, so callers see exactly the same return timing as before, and the translate-then/catch chain still runs in the background. A defensive log-and-return guard is added when the helper is called with a row that has originalLanguage IS NULL (the original-language row itself); the existing public entry point never produces such a row, but a forthcoming caller will pass already-existing translation rows through this path. Assisted-by: Claude Code:claude-opus-4-7
Inside the new runArticleContentTranslation helper, take a JS-side claim Date as ownership stamp on the placeholder row before kicking off the translate() chain. The success UPDATE and the failure DELETE both gate on the row's `updated` still equalling that claim — a concurrent re-translation that resets the row out from under us bumps `updated` past the claim, and our writes turn into no-ops instead of clobbering the fresher placeholder with stale text or deleting it on a stale failure. A JS Date is used (rather than PG `CURRENT_TIMESTAMP`) because the postgres driver hands back JS `Date` values truncated to ms while PG `timestamptz` keeps µs precision; a CAS against a round-tripped `CURRENT_TIMESTAMP` write would never match its own claim. This mirrors the claim pattern already used by startArticleContentSummary in this file. The race this prevents only becomes reachable in practice with the forthcoming retranslation-on-edit code path, but the guard is self-contained and the existing failure-cleanup test still passes, so it lands ahead of that change as a pure correctness fix. Assisted-by: Claude Code:claude-opus-4-7
Edits to an article's body now invalidate every existing translation row for that article and re-runs the translator against the fresh original. Previously, translations stayed pinned to whatever text they were generated from, so /ja, /en, etc. silently drifted out of sync with the edited original. The trigger lives in two pieces: - updateArticleSource now returns an extra `originalContentChanged` flag set when the original-language `article_content` row's body actually changed. The flag mirrors the existing summary invalidation gate: title-only edits do not set it, so a typo fix on the title doesn't fan out N+1 LLM calls; the body-changed gate also catches the case the summary path already cared about. A successful `languageChanged` cannot reach this branch when translations exist (the self-FK on `article_content` re-throws 23503 as `LanguageChangeWithTranslationsError`), so the flag is effectively `contentChanged`. - A new restartArticleContentTranslations(fedCtx, articleSource) resets each translation row to placeholder state — copying the new original title/content into it, flipping `beingTranslated` back to true, and clearing summary state — then fires the extracted runArticleContentTranslation helper against each reset row in the background. updateArticle reads the new flag and awaits the synchronous claim-and-reset step so placeholders are visible by the time the mutation responds; the actual translate() calls run in the background, same as the initial translation flow. Each re-translation that completes sends its own federation Update so peers see the original-language change first and then each translation's refresh as it becomes available — the right ActivityPub semantics. The CAS guard added in the previous commit keeps an in-flight stale translation from clobbering the new placeholder with stale text on completion. The summary-test callers of updateArticleSource only assert truthiness on the return value, so the new return shape passes through them unchanged. The dedicated source-level test that previously asserted "preserves translations" splits into two: one for the body-change case (now asserting the flag flips and the existing row is preserved here for the outer caller to invalidate) and one for the title-only carve-out. New end-to-end tests in article.lifecycle.test.ts cover updateArticle's full body-change and title-only paths. Fixes hackers-pub#95 Assisted-by: Claude Code:claude-opus-4-7
Three fixes uncovered by an external code review of commits 302c242..422ee87. 1. Translate the freshest body, not the queued snapshot (models/article.ts). runArticleContentTranslation's claim step now reads the row back via RETURNING and uses `claimed.title` / `claimed.content` for the translate input. This closes a race where two restartArticleContentTranslations calls interleave: the older restart's runArticleContentTranslation would otherwise translate the older body even after the newer restart wrote the newer body into the placeholder. The claim WHERE also now requires `beingTranslated=true`, so the helper bails silently if the row has been completed (or deleted) between the caller queueing this run and the claim landing, instead of stamping a claim onto a no-longer-relevant row. 2. Make translation Update activity ids unique (models/article.ts). Previously every translation completion reused `#update/${article.updated.toISOString()}` as the activity id. `article_source.updated` is not bumped by translation completions, so the original-language Update and every per-locale translation Update for the same edit all collided. The id now incorporates the target language and the translated row's own (just-bumped) `updated` so each completion mints a distinct id, regardless of how many locales finish in close succession. 3. Catch background retranslation kickoff failures (models/article.ts). restartArticleContentTranslations did `void runArticleContentTranslation(...)`, which loses any rejection thrown by the synchronous setup before the internal `.then().catch()` chain is installed (e.g., a transient DB error during the claim UPDATE or the article-source fetch). Such rejections would surface as unhandled promise rejections in the runtime; they now go through a `.catch()` that logs the failed (sourceId, language) and the error. All 29 article-domain tests still pass. Assisted-by: Codex:gpt-5.5 Assisted-by: Claude Code:claude-opus-4-7
Wrap the read-original-then-reset-translations sequence in restartArticleContentTranslations in a transaction that takes `SELECT … FOR UPDATE` on the article_source row. Concurrent restartArticleContentTranslations calls (driven by back-to-back edits to the same article) would otherwise each read their own snapshot of the original-language body and then overwrite each other's placeholder writes — the older restart's iteration could land after the newer restart's, leaving the placeholder pointing at the older body and the eventual translation reflecting that older body even though a newer edit had committed. The article_source row lock is the same row-level write lock that updateArticleSource takes on its UPDATE, so concurrent edits and restarts queue up cleanly without any new lock graph. The actual translate() round-trip runs after the transaction commits — the runArticleContentTranslation calls are scheduled outside the tx block — so the LLM call doesn't extend the lock window. Follow-up to the previous commit (Address Codex review of the retranslation flow), which addressed three of four issues; this one closes the remaining stale-snapshot race that survived the freshest-body claim refactor in runArticleContentTranslation. Assisted-by: Codex:gpt-5.5 Assisted-by: Claude Code:claude-opus-4-7
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughupdateArticleSource now reports when the original body actually changed; updateArticle uses that flag to call restartArticleContentTranslations which resets LLM-requested translation rows to placeholders and re-queues translations with CAS ownership. runArticleContentTranslation uses JS-side claim stamps. ai/translate.ts stops eagerly fetching linked pages and exposes fetchWebPage as a callable tool. ChangesArticle translation restart and CAS
Sequence DiagramsequenceDiagram
participant User
participant API
participant DB
participant Translator
participant Summarizer
User->>API: updateArticle(newTitle?, newContent?)
API->>DB: updateArticleSource(new data)
DB-->>API: { source, originalContentChanged }
alt originalContentChanged == true and allowLlmTranslation
API->>DB: restartArticleContentTranslations(source)
DB->>DB: LOCK article_source FOR UPDATE
DB->>DB: SELECT translation rows (LLM-requested)
loop per translation row
DB->>DB: UPDATE row -> placeholder (copy source title/content, clear summary, set beingTranslated, stamp updated)
DB-->>API: queued
end
par async per queued row
API->>Translator: runArticleContentTranslation(rowId)
Translator->>DB: claim placeholder (updated := claim)
Translator->>Translator: translate from claimed.title/claimed.content
alt claim still valid
Translator->>DB: CAS UPDATE success (only if updated === claim)
DB-->>Translator: persisted
Translator->>Summarizer: schedule post-translation summary
else claim stale
Translator-->>DB: log & skip persistence
end
end
end
API-->>User: updateArticle response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 23 minutes and 28 seconds.Comment |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request implements automatic re-translation of articles when the original content is updated. It introduces a restartArticleContentTranslations function that resets existing translation rows to a placeholder state and triggers background re-translation using a new runArticleContentTranslation helper. The implementation includes a concurrency control mechanism using a Check-and-Set (CAS) pattern on the updated timestamp and ensures unique ActivityPub update IDs for translation completions. Additionally, the translation logic was optimized to fetch linked pages lazily as tools. Feedback was provided regarding the duplication of the waitFor test helper across multiple test files.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f649b60f05
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
updateArticle() called restartArticleContentTranslations() on every body change without checking the article-level allowLlmTranslation flag. When an edit toggles the flag off and changes the body in the same update, the previous behavior would still enqueue background translate() runs for every existing translation row, overriding the author's just-expressed wish and incurring model usage / cost / data-egress for translations they no longer want. Gate the call on articleSource.allowLlmTranslation. Existing translation rows from before the flag was flipped off are left alone (stale, not refreshed); flipping the flag back on and editing the body again brings them back into sync. Adds a regression test covering the gate. hackers-pub#283 (comment) Assisted-by: Codex:gpt-5.5 Assisted-by: Claude Code:claude-opus-4-7
Both models/article.background.test.ts and models/article.lifecycle.test.ts had a private copy of the same async-polling helper that watches for background-promise side effects (translation completion, placeholder reset, summary cleanup) without exposing the underlying promise to the test. Move it to test/wait.ts so future tests don't grow a third copy when they hit the same observation pattern. Behavior is unchanged: same 50 ms poll cadence, same default 10 s deadline, same error message on timeout. hackers-pub#283 (comment) Assisted-by: Claude Code:claude-opus-4-7
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to automatically re-translate articles when the original body content is updated. Key changes include the addition of restartArticleContentTranslations to reset translation rows and runArticleContentTranslation to handle LLM interactions using a Compare-And-Swap pattern for concurrency safety. The translation pipeline was also refined to provide web context as a passive tool, preventing model confusion. Feedback focuses on optimizing the database interaction in restartArticleContentTranslations by replacing the loop of individual updates with a single batch UPDATE statement to improve performance when multiple translations exist.
|
Codex Review: Didn't find any major issues. Delightful! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
restartArticleContentTranslations() previously listed every translation row for the article via a relational findMany and then issued a per-row UPDATE inside the transaction. Each row gets the same placeholder values, so the loop is N+1 round trips against the same row set. Replace the findMany + per-row UPDATE loop with a single UPDATE filtered on (sourceId = articleSource.id AND originalLanguage IS NOT NULL). The filter targets exactly the same rows the previous implementation iterated; the schema check article_content_being_translated_check requires originalLanguage IS NOT NULL whenever beingTranslated=true so the filter doubles as the schema-safety guard the per-iteration comment used to explain. RETURNING gives us the reset rows for the post-commit fan-out into runArticleContentTranslation, the same shape callers depended on. Behavior is otherwise unchanged: still inside the same transaction that holds the article_source FOR UPDATE row lock, still preserves originalLanguage and translationRequesterId by omitting them from set, still bumps updated to CURRENT_TIMESTAMP. The "no rows to restart" early return is now implicit in the UPDATE returning an empty array; the surrounding logger calls are kept and reworded slightly for the new shape. All 30 article-domain tests still pass. hackers-pub#283 (comment) Assisted-by: Claude Code:claude-opus-4-7
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements automatic retranslation of articles when the original body is updated by introducing a mechanism to reset translation rows and trigger background processing. It includes a Compare-And-Swap (CAS) logic to handle concurrent updates and ensures unique ActivityPub Update IDs for translations. Feedback suggests clearing ogImageKey during resets to prevent stale assets and refining the CAS claim logic to avoid redundant LLM calls during rapid successive edits.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 59503306de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
article_content.ogImageKey caches the rendered OG image for a row. The image is rendered from the row's title and content, so any write that replaces those fields makes the cached image stale: the crawler that follows the cached URL will see an image whose heading or excerpt no longer matches the live body. Add ogImageKey: null to the two write paths that replace a translation row's title/content: - restartArticleContentTranslations()'s reset UPDATE, which copies the new original title/content into every translation placeholder on a body edit. - runArticleContentTranslation()'s success write, which lands the newly translated title/content for one row. Both paths already clear the matching summary state (summary, summaryStarted, summaryUnnecessary) for the same reason; the OG image is the same kind of derived asset and gets cleared next to them. Lazy regeneration on the next OG-image request rebuilds from the freshly written body. The original-language UPDATE in updateArticleSource() is left alone for now; that staleness predates this PR and is the same question for the original row, not specifically a retranslation concern. hackers-pub#283 (comment) Assisted-by: Claude Code:claude-opus-4-7
runArticleContentTranslation()'s claim UPDATE only required beingTranslated=true on the row, so two background runs scheduled for the same (sourceId, language) by back-to-back edits would both pass the claim check and both call translate(). The success / failure CAS at the end still ensured only one write landed, but the second translate() round trip was wasted LLM cost. A related concern is that JS Date is millisecond-precision, so two claim values created in the same millisecond compare equal; the existing CAS happens to remain correct because the first success bumps `updated` to a new value that the second can no longer match, but the protocol felt easier to reason about with a stronger uniqueness story up front. Add `updated = queued.updated` to the claim WHERE so each run only claims the placeholder it was queued for. When a later restart re-stamps the row, the older run's claim turns into a no-op instead of going through to translate(). For this CAS to be lossless, every writer that produces a row the helper later treats as `queued` now stamps `updated` with a JS Date (which round-trips through the postgres driver at the same ms precision the JS side stores) instead of letting it default to PG's CURRENT_TIMESTAMP (which keeps µs and would lose precision on the round trip): - startArticleContentTranslation()'s INSERT: explicit updated: queueStamp. - startArticleContentTranslation()'s stuck-row recovery: an explicit re-stamp UPDATE before handing the row off to the helper, so the helper's CAS has a value to match. Returns early without scheduling translate() if a concurrent writer took the row out from under us between the SELECT and the re-stamp. - restartArticleContentTranslations()'s reset UPDATE: explicit updated: restartStamp instead of sql`CURRENT_TIMESTAMP`. The claim WHERE in runArticleContentTranslation() then matches this stamp exactly. All 30 article-domain tests still pass; the existing fixtures already insert with JS Date stamps, so no test changes are needed. hackers-pub#283 (comment) hackers-pub#283 (comment) Assisted-by: Codex:gpt-5.5 Assisted-by: Gemini Code Assist:gemini-2.5 Assisted-by: Claude Code:claude-opus-4-7
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements automatic retranslation of articles when the original content body is updated. It introduces a background retranslation pipeline using a Check-and-Set (CAS) mechanism with millisecond-precision timestamps to handle concurrent updates safely. The changes also include logic to skip retranslation for title-only edits and respect the allowLlmTranslation setting. Feedback was provided regarding the robustness of the Markdown title extraction logic, suggesting a more resilient parsing approach or stricter prompt instructions to handle potential variations in LLM output format.
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
The success path of runArticleContentTranslation() split the LLM output into title and body with two strict regexes anchored at ^\s*#\s+, so any model output that didn't begin exactly with a `# ` H1 left title="" and dumped the entire raw output into content. The most likely cause in production was a model that prefaced the translation with a brief commentary line (a translator's note, hedging about preserved terms, etc.) before the H1. The previous commit's fetchWebPage switch made this much less common, but the parser was still fragile. Replace the inline regexes with a small splitTranslationTitleAndContent helper that: - Accepts the first H1 line anywhere in the output as the title, not just at offset 0. Any text before that H1 is dropped on the assumption it is model commentary, since the prompt did not contain anything before `# Title`. - Falls back to the first non-empty line as the title (with the rest as the body) if there is no H1 at all. - Restricts H1 detection to a single `#` followed by whitespace, so later `## Section` headings inside the body are not mis-promoted to the title. - Trims surrounding whitespace from both halves. The helper is exported and covered by a dedicated unit test file exercising clean H1 + body, H1 with preamble, no H1 at all, H2-only input, H1-only input, empty / whitespace-only input, and title-line whitespace trimming. hackers-pub#283 (comment) Assisted-by: Gemini Code Assist:gemini-2.5 Assisted-by: Claude Code:claude-opus-4-7
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements automatic retranslation of articles when their original content is updated. Key changes include the introduction of a restartArticleContentTranslations function, a more robust LLM output parser in splitTranslationTitleAndContent, and a concurrency-safe claim mechanism for background tasks using JS Date timestamps to handle database precision differences. The feedback identifies a redundant null-coalescing fallback for the source language in the translation routine that can be simplified due to existing guards and schema constraints.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f59757a238
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
restartArticleContentTranslations()'s reset filter selected every row with `originalLanguage IS NOT NULL`, which catches both LLM-requested translation rows (carrying `translationRequesterId`) and human-curated translation rows (carrying `translatorId`). On a body edit, a human translation would therefore be reset to the new source-language placeholder and re-run through the LLM, silently destroying the translator's work and leaving incorrect attribution metadata on machine-generated output. Add `translatorId IS NULL` to the WHERE so the restart only touches LLM-requested rows. The schema check `article_content_translator_translation_requester_id_check` already makes `translatorId` and `translationRequesterId` mutually exclusive, so the new predicate cleanly partitions human vs LLM translations without needing to also test `translationRequesterId IS NOT NULL`. Adds a regression test that creates one human translation and one LLM translation alongside it on the same article, edits the body, and asserts the human row is untouched while the LLM row is picked up by the restart. hackers-pub#283 (comment) Assisted-by: Codex:gpt-5.5 Assisted-by: Claude Code:claude-opus-4-7
The translate() input in runArticleContentTranslation() passed sourceLanguage as `claimed.originalLanguage ?? originalLanguage`, where the right-hand side was the destructured local from the caller-provided `queued`. The fallback was dead code: the claim WHERE required `beingTranslated=true`, and the schema check `article_content_being_translated_check` makes that imply `originalLanguage IS NOT NULL`, so `claimed.originalLanguage` is always set on a row that came back from the claim's RETURNING. Drop the fallback and use a non-null assertion on `claimed.originalLanguage` (drizzle types the column as nullable because it is nullable in general, but in this position the schema guarantees otherwise). A short comment captures the chain of guarantees so a future reader doesn't reach for the same defensive `??` again. hackers-pub#283 (comment) Assisted-by: Gemini Code Assist:gemini-2.5 Assisted-by: Claude Code:claude-opus-4-7
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@models/article.ts`:
- Around line 815-824: The reclaim update is not CAS-safe: the
db.update(articleContentTable).set({...}).where(...) only checks
articleContentTable.beingTranslated and can overwrite a fresh claim by another
worker; change the WHERE in the reclaim/update to require that
articleContentTable.updated (or the exact previous timestamp/value you read into
e.g. content.updated or currentUpdated) still equals the stale value you
observed, so the UPDATE only succeeds if no other worker changed the row, and
then check the returning() result (zero rows means another worker claimed it)
before proceeding with translate(); reference the reclaimed variable, the update
call on articleContentTable, and the articleContentTable.beingTranslated and
articleContentTable.updated fields.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 95f1839f-0218-44d4-9bbe-d147c553cec8
📒 Files selected for processing (5)
models/article.background.test.tsmodels/article.lifecycle.test.tsmodels/article.translation-split.test.tsmodels/article.tstest/wait.ts
✅ Files skipped from review due to trivial changes (1)
- models/article.translation-split.test.ts
There was a problem hiding this comment.
Code Review
This pull request implements automatic re-translation of articles when the original content is updated. It introduces a background processing pipeline that invalidates existing LLM-generated translations while preserving human-curated ones. Key changes include a new restartArticleContentTranslations function, a robust title/content splitter for LLM outputs, and a concurrency-safe claim system using timestamps. Feedback focuses on using idiomatic Drizzle methods for row locking and simplifying the logic for parsing translated content.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ff61bc24db
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The stuck-row reclaim branch in startArticleContentTranslation() SELECTed the row, decided it was stale (>30 min old and still beingTranslated), then UPDATEd it with only beingTranslated=true in the WHERE. A concurrent worker that landed its own reclaim in the gap between our SELECT and our UPDATE would have its fresh claim overwritten by ours, and both workers would then go on to call runArticleContentTranslation(), wasting one LLM round trip. Repeat the same staleness check inside the UPDATE itself (`updated < CURRENT_TIMESTAMP - INTERVAL '30 minutes'`). Under PG read-committed isolation, the second UPDATE waits for the first's row lock, then re-evaluates its WHERE on the now-fresh row state and drops the row from the candidate set, so the second worker's reclaim returns 0 rows and bails cleanly. Same predicate as the SELECT, so it stays in lockstep with the visibility logic above. hackers-pub#283 (comment) Assisted-by: CodeRabbit:cr-3 Assisted-by: Claude Code:claude-opus-4-7
restartArticleContentTranslations() acquired the article_source
row lock with raw SQL via tx.execute(sql\`SELECT 1 FROM ... FOR
UPDATE\`). Drizzle exposes the same primitive idiomatically as
.for("update") on a select query, which keeps column references
and parameter binding consistent with the rest of this file's
query style. Same lock, same semantics, no behavior change.
hackers-pub#283 (comment)
Assisted-by: Gemini Code Assist:gemini-2.5
Assisted-by: Claude Code:claude-opus-4-7
splitTranslationTitleAndContent() previously took the first H1 line anywhere in the output as the title, dropping any text before it on the assumption that it was model commentary. That handled a "preamble before # Title" case nicely but introduced a data-loss path: when the model omitted the article-title H1 and the body happened to contain its own H1 section heading, the parser silently truncated the article (mis-promoted body H1 to title, dropped the intro paragraphs above it). Silent truncation is a worse failure mode than a visible-but-ugly title. Anchor parsing to the first line of the trimmed output: - The first line becomes the title. - If it begins with `# `, the marker is stripped; otherwise the whole line is the title verbatim. - Everything after the first line becomes the body, untouched. The strict implementation also folds in the related cleanup of the prior fallback branch. Because `trimmed` is the result of `String#trim()`, it is guaranteed to either be empty or begin with a non-whitespace character, so the previous `findIndex` for the first non-empty line was redundant and is now removed. The H1-marker regex is restricted to a single `#` followed by whitespace so a first-line `## Section` is taken verbatim instead of being mis-stripped. Tests in models/article.translation-split.test.ts are updated to match the new behavior: the preamble case now keeps the preamble visible as the title with the body's `# Title` framing preserved, and the rest of the cases (clean H1+body, no-H1 fallback, first-line H2, body H2 sections, empty input, H1-only input, title trimming) continue to pass unchanged. hackers-pub#283 (comment) hackers-pub#283 (comment) Assisted-by: Codex:gpt-5.5 Assisted-by: Gemini Code Assist:gemini-2.5 Assisted-by: Claude Code:claude-opus-4-7
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements a mechanism to automatically re-trigger article translations when the original content's body is updated. It introduces a restartArticleContentTranslations function that resets existing LLM-generated translation rows to a placeholder state and re-runs the translation pipeline. Key improvements include a concurrency-safe Check-and-Set (CAS) mechanism using JavaScript Date objects for database updates, a more robust title/body splitting logic for LLM outputs, and updated ActivityPub activity IDs to ensure uniqueness across multiple translation completions. The PR also refactors test utilities into a shared waitFor helper and adds comprehensive test coverage for the new lifecycle logic. I have no feedback to provide.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5fd961d6e9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The stuck-row reclaim branch in startArticleContentTranslation() only bumped `updated` and left the row's `title` / `content` alone. That was fine when the helper translated from the caller-provided original content, but runArticleContentTranslation() now translates from `claimed.title` / `claimed.content` (the row's own state read back via RETURNING), so a placeholder that had been stuck since before a body edit would be retranslated using the OLD source text and publish a translation that no longer matches the article. Concrete failure path: an editor disables allowLlmTranslation, edits the body (no restart fires because of the gate), waits out the 30-min stale window, re-enables allowLlmTranslation, and a later requestArticleTranslation hits the stuck-recovery branch. Without this fix, the resulting translation reflects the body from before the edit. Extend the reclaim UPDATE to copy the caller-provided original title / content (and the matching language) into the placeholder, and clear the `summary` / `summaryStarted` / `summaryUnnecessary` / `ogImageKey` derived state for the same reason as restartArticleContentTranslations() does. The caller of startArticleContentTranslation() always passes the article's current original-language content row, which is the right source of truth at request time. hackers-pub#283 (comment) Assisted-by: Codex:gpt-5.5 Assisted-by: Claude Code:claude-opus-4-7
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements a mechanism to automatically invalidate and re-run translations for articles when their original-language content is updated. It introduces a new restartArticleContentTranslations function, updates updateArticleSource to track content changes, and adds robust testing with a new waitFor utility to handle asynchronous background translation state. I have no feedback to provide as the changes are well-structured and properly tested.
|
Codex Review: Didn't find any major issues. Bravo. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Two related fixes to the LLM-powered article translation feature, shipped together because they share the same surface in models/article.ts and the second one surfaces failure modes the first could amplify.
1. Translation output pollution
The translation pipeline passed Vertana's
fetchLinkedPagesas a required context source, which fetches up to ten linked pages with their full Readability-extracted body and dumps everything into the translator's system prompt asAdditional context: …. For articles that link to long technical posts on the same topic, the system-prompt context easily ballooned to multiples of the source-text size and the model began confusing the context for the text to translate. The concrete failure observed on@nebuleto/2026/swe-with-coding-agent-in-2026—a Korean-original article linking to Anthropic's engineering blog and JetBrains research—is that theen-UStranslation came back as the linked Anthropic blog post in English (~3.5x the original length) and thejatranslation came back entirely in Korean, paraphrasing the linked content rather than translating the source.The fix in ai/translate.ts swaps
fetchLinkedPages(mode"required") forfetchWebPage(mode"passive"). Vertana exposes passive sources as tools the model can call when it needs them, rather than pre-fetching everything into the prompt, so the model sees a clean translation context and only invokes a fetch when it decides to. Other context sources (author info, article tags) and Vertana options (refinement, dynamic glossary, technical tone) are unchanged.2. Stale translations after edit
Per #95, editing an article only updated the original-language
article_contentrow; existing translation rows kept their previously translated text and silently drifted out of sync with the original. Vertana doesn't support incremental translation yet, so the semantics here is "edit→reset each translation row to a placeholder pointing at the new original→re-run the full translator." Title-only edits don't trigger retranslation, mirroring the existing summary-invalidation gate in models/article.ts. Language changes can't reach the trigger when translations exist (the self-FK onarticle_contentrethrows 23503 asLanguageChangeWithTranslationsError).The trigger sits in two pieces.
updateArticleSourcenow returns an extraoriginalContentChangedflag set when the original-language row's body actually changed; the flag is captured inside the existing transaction alongside the existingresummarizeTarget. A new exportedrestartArticleContentTranslations(fedCtx, articleSource)does the actual work: re-fetches the latest original-language content, finds all translation rows, atomically resets each to placeholder state (copying the new original title/content into it, flippingbeingTranslatedback to true, clearing summary state), and then schedulesrunArticleContentTranslation(fedCtx, resetRow)against each.updateArticlereads the flag andawaits the synchronous claim-and-reset step so placeholders are visible by the time the mutation responds; the actualtranslate()calls run in the background, same as the initial translation flow. Each completed retranslation sends its own federationUpdateactivity—the right ActivityPub semantics: peers see the original-language change first and then each translation's refresh as it becomes available.Architecture notes
runArticleContentTranslationis extracted out ofstartArticleContentTranslationso it can be shared between the initial-queue path and the restart-on-edit path. The split is a pure refactor; the original return timing (sync setup awaited, translate chain in the background) is preserved.Date"claim" stamped onto the row at the top ofrunArticleContentTranslation. Both the success UPDATE and the failure DELETE require the row'supdatedto still equal that claim, so a concurrent retranslation that reset the placeholder under us turns our writes into no-ops instead of clobbering its fresher placeholder. A JSDate(rather thanCURRENT_TIMESTAMP) is what makes the CAS reliable: PGtimestamptzkeeps µs precision while thepostgresdriver hands back JSDatetruncated to ms, so a CAS against a round-tripped server-side timestamp would never match. This mirrors the claim pattern already used bystartArticleContentSummary.RETURNING, and the helper usesclaimed.title/claimed.contentfor the translate input rather than the caller-providedqueuedsnapshot, so a concurrent restart that just wrote a newer placeholder under us still gets translated against the newer body.Updateactivity ids for translation completions now include the target language and the translated row's own (just-bumped)updatedso multiple per-locale completions don't collide onarticle.updated, which translation completions don't bump.restartArticleContentTranslationsruns the read-original-then-reset-translations sequence inside a transaction whose first statement isSELECT 1 FROM article_source WHERE id=… FOR UPDATE. Concurrent restarts (driven by back-to-back edits to the same article) would otherwise each read their own snapshot of the original and overwrite each other's placeholder writes; the row lock—the same oneupdateArticleSourcetakes during its UPDATE—queues them up cleanly. The actualtranslate()calls are scheduled after the transaction commits so the LLM round-trip doesn't extend the lock window.Tests added
originalContentChangedflag flips on body changes and the existing translation row is preserved at this layer (the actual reset lives in theupdateArticlestep, not here); one asserts the flag stays false on title-only edits.restartArticleContentTranslations()test that asserts the placeholder reset happens with the new original title/content and the failing stub-translator subsequently cleans the row up, plus a no-op test for articles with no translation rows.updateArticle()body-change retranslation test plus a title-only no-op test.Verification
deno task test models/article.background.test.ts models/article.source.test.ts models/article.lifecycle.test.ts models/article.summary.test.ts graphql/post.more.test.tspasses clean (49 tests). Seven pre-existing failures in graphql/admin.test.ts, graphql/timeline.test.ts, models/post.delete.test.ts, and models/timeline.query.test.ts were verified against the merge base and are unrelated to this work.The new code went through external code review (Codex GPT-5.5) before opening this PR; the last two commits address the issues that surfaced.
Out of scope
fetchLinkedPagesbug. Site operators can re-request from the UI by re-editing the article, since the new code path triggers a fresh translation on body change.fetchLinkedPages. Tracked separately.updateArticleon every keystroke, since each call would now fan out N LLM translations.Fixes
Closes #95.